Fast Structural Alignment of Biomolecules Using a Hash Table, N-Grams and String Descriptors

نویسندگان

  • Raphael André Bauer
  • Kristian Rother
  • Peter Moor
  • Knut Reinert
  • Thomas Steinke
  • Janusz M. Bujnicki
  • Robert Preissner
چکیده

This work presents a generalized approach for the fast structural alignment of thousands of macromolecular structures. The method uses string representations of a macromolecular structure and a hash table that stores n-grams of a certain size for searching. To this end, macromolecular structure-to-string translators were implemented for protein and RNA structures. A query against the index is performed in two hierarchical steps to unite speed and precision. In the first step the query structure is translated into n-grams, and all target structures containing these n-grams are retrieved from the hash table. In the second step all corresponding n-grams of the query and each target structure are subsequently aligned, and after each alignment a score is calculated based on the matching n-grams of query and target. The extendable framework enables the user to query and structurally align thousands of protein and RNA structures on a commodity machine and is available as open source from http://lajolla.sf.net. Algorithms 2009, 2 693

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Minimum Perfect H Fast N-gram Language

A new technique is proposed for N-gram language model (LM) retrieval based on minimum perfect hashing (MPH). A hierarchical data structure is used to store N-gram scores in hash tables according to the order of N-grams, and a LM score is retrieved by probing the appropriate hash table slot without collision. Both integer key and character-string key based MPH functions are studied. The proposed...

متن کامل

1 Hash Table Sizes for Storing N - Grams for Text Processing

N-grams have been widely investigated for a number of text processing tasks. However n-gram based systems often labor under the large memory requirements of naïve storage of the large vectors that describe the many n-grams that could potentially appear in documents. This problem becomes more severe as the number of documents (and hence the number of vectors to store and process) rises. A natura...

متن کامل

Hash Table Sizes for Storing N-Grams for Text Processing

N-grams have been widely investigated for a number of text processing tasks. However n-gram based systems often labor under the large memory requirements of naïve storage of the large vectors that describe the many n-grams that could potentially appear in documents. This problem becomes more severe as the number of documents (and hence the number of vectors to store and process) rises. A natura...

متن کامل

Hashing and Merging Heuristics for Text Reuse Detection

This paper describes a joint software entry by King Fahd University of Petroleum & Minerals and the University of Sheffield for the text-alignment task at PAN-2014. We employ the three steps of seeding, extension and filtering for text alignment. For seeding we use character n-grams with a variant of the RabinKarp Algorithm for multiple pattern search. We then use an elaborate merging mechanism...

متن کامل

Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search

The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often high-dimensional and binary (e.g., text n-grams), minwise hashing is widely adopted, which requires applying a large number of permutat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Algorithms

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2009